Skip to content

verifier: mkdir -p /app before pytest#202

Open
xdotli wants to merge 1 commit intomainfrom
fix/verifier-mkdir-app
Open

verifier: mkdir -p /app before pytest#202
xdotli wants to merge 1 commit intomainfrom
fix/verifier-mkdir-app

Conversation

@xdotli
Copy link
Copy Markdown
Member

@xdotli xdotli commented Apr 25, 2026

Summary

VERIFIER_ENV pins PYTEST_ADDOPTS="--rootdir=/app …" (see _sandbox.py:307) for test-node-ID anchoring across SkillsBench / TB-style tasks. Tasks whose Dockerfile WORKDIRs elsewhere (e.g. /root) and don't otherwise create /app trip pytest's at-startup directory check:

ERROR: Directory '/app' not found. Check your '--rootdir' option.

The verifier aborts before reaching test_outputs.py and the trial scores 0 even when the agent produced correct output.

Surfaced today by trajectory analysis of the SkillsBench Apr 2026 patch trial:

  • pg-essay-to-audiobook wrote /root/audiobook.mp3 (29:45 MP3, valid ffmpeg output) and the verifier's pytest still aborted on the missing /app rootdir.
  • Same shape on scheduling-email-assistant and several other Trial 1 ERROR-classified tasks.

skill_eval's Dockerfile augmentation already does this (src/benchflow/skill_eval.py:325 runs mkdir -p /logs/verifier /logs/agent /logs/artifacts /app /tests). This PR adds the equivalent to the regular eval path so all benchflow runs get /app, not just skill_eval ones.

Idempotence

mkdir -p is a no-op when the directory already exists. We deliberately don't chmod /app so any task content (e.g. swebench-style repo at /app/<repo>) stays root-owned and undisturbed.

Test plan

  • Re-run pg-essay-to-audiobook and scheduling-email-assistant from the SkillsBench Apr 2026 patch trial — verifier should now reach test_outputs.py instead of aborting at pytest's rootdir check.
  • Existing tasks with WORKDIR /app (e.g. azure-bgp-oscillation-route-leak, dialogue-parser) should be unaffected — mkdir -p is a no-op there.
  • No change to skill_eval flow (it was already doing this).

Open in Devin Review

VERIFIER_ENV pins PYTEST_ADDOPTS=--rootdir=/app for test-node-ID
anchoring across SkillsBench / TB-style tasks. Tasks whose Dockerfile
WORKDIRs elsewhere (e.g. /root) and don't otherwise create /app then
trip pytest's at-startup directory check:

  ERROR: Directory '/app' not found. Check your '--rootdir' option.

Verifier aborts before reaching test_outputs.py and the trial scores
0 even when the agent produced correct output. Surfaced today by
trajectory analysis of the SkillsBench Apr 2026 patch trial:
pg-essay-to-audiobook wrote /root/audiobook.mp3 (29:45 MP3, valid
ffmpeg output) and the verifier's pytest still aborted on the
missing /app rootdir; same shape on scheduling-email-assistant.

Mirrors the skill_eval Dockerfile pattern (skill_eval.py:325) which
already RUNs 'mkdir -p /logs/verifier /logs/agent /logs/artifacts
/app /tests'. Add /app to the verifier dir prep step in trial.py so
all benchflow eval paths (not just skill_eval) get it.

Idempotent: tasks that DO populate /app are unaffected — mkdir -p
is a no-op when the directory exists, and we deliberately don't
chmod /app (any pre-existing task content stays root-owned).
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 2 additional findings in Devin Review.

Open in Devin Review

Comment thread src/benchflow/trial.py
Comment on lines 554 to 557
await self._env.exec(
"rm -rf /logs/verifier && mkdir -p /logs/verifier && chmod 777 /logs/verifier",
"rm -rf /logs/verifier && mkdir -p /logs/verifier /app && "
"chmod 777 /logs/verifier",
user="root", timeout_sec=10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 mkdir -p /app fix applied to soft_verify but not to the full verify path's _CLEAR_VERIFIER_DIR_CMD

The PR adds mkdir -p /app to soft_verify() to prevent pytest from aborting with "Directory '/app' not found" when VERIFIER_ENV pins PYTEST_ADDOPTS=--rootdir=/app and the task's Dockerfile WORKDIRs elsewhere (e.g. /root). However, the same issue exists in the full verify() path: sdk._verify() calls harden_before_verify() at src/benchflow/sdk.py:403, which executes _CLEAR_VERIFIER_DIR_CMD at src/benchflow/_sandbox.py:717 — and that constant at src/benchflow/_sandbox.py:593-595 does not include mkdir -p /app. Since harden_before_verify() also sets PYTEST_ADDOPTS with --rootdir=/app (via VERIFIER_ENV at src/benchflow/_sandbox.py:304-308), the same pytest abort will occur during the final verify() for any task whose Dockerfile doesn't create /app.

Prompt for agents
The mkdir -p /app fix in soft_verify() at src/benchflow/trial.py:555 addresses a real issue where pytest aborts with 'Directory /app not found' when VERIFIER_ENV sets --rootdir=/app but the task Dockerfile WORKDIRs elsewhere. However, the same fix needs to be applied to the full verify() path.

The constant _CLEAR_VERIFIER_DIR_CMD at src/benchflow/_sandbox.py:593-595 is used by harden_before_verify() at line 717, which is called from sdk._verify() at src/benchflow/sdk.py:403. This constant should also include 'mkdir -p /app' to ensure /app exists before the verifier runs pytest with --rootdir=/app.

Suggested fix: Update _CLEAR_VERIFIER_DIR_CMD in src/benchflow/_sandbox.py from:
  "rm -rf /logs/verifier && mkdir -p /logs/verifier && chmod 777 /logs/verifier"
to:
  "rm -rf /logs/verifier && mkdir -p /logs/verifier /app && chmod 777 /logs/verifier"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant